Issue 320: support clean shutdown #953

benoit74 · 2020-03-24T17:54:40Z

Tries to solve #320

Feedback, code review, suggestion more than welcome.

Code is functional (i.e. it has been tested in live conditions inside a K8S cluster.

drain = start a clean shutdown, i.e. terminate ongoing commands and refuse to start new ones.

Changes :

new endpoint "POST /drain" on localhost:4141 to start the drain
new endpoint "GET /drain" on localhost:4141 to get the status of the frain
new command "atltantis drain" to start the drain and wait for the status to be completed (typically intended to be called from a preStop hook when running inside a K8S cluster)
when the user tries to start a command while a drain is already ongoing, the action is ignored and a message is displayed in his VCS: "Atlantis server is shutting down, please try again later."

benoit74 · 2020-03-24T18:01:09Z

I just broke some tests which are not initialized correctly now that I added some dependencies, I will fix this.

codecov · 2020-03-24T19:49:55Z

Codecov Report

Merging #953 into master will increase coverage by 0.05%.
The diff coverage is 76.00%.

@@            Coverage Diff             @@
##           master     #953      +/-   ##
==========================================
+ Coverage   71.98%   72.03%   +0.05%     
==========================================
  Files          65       68       +3     
  Lines        5411     5486      +75     
==========================================
+ Hits         3895     3952      +57     
- Misses       1210     1225      +15     
- Partials      306      309       +3

Impacted Files	Coverage Δ
cmd/drain.go	`0.00% <0.00%> (ø)`
server/events/command_runner.go	`50.58% <71.42%> (+1.21%)`	⬆️
server/server.go	`64.03% <77.77%> (+0.37%)`	⬆️
server/drain_controller.go	`80.00% <80.00%> (ø)`
server/events/drainer.go	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 12801dd...fa64779. Read the comment docs.

benoit74 · 2020-03-24T21:25:18Z

I'm done with the CI, ready for review.

benoit74 · 2020-03-26T07:52:46Z

@lkysow : Could you please have a look at this or assign someone ?
PS : the chart update is ready as well (https://github.com/benoit74/charts/tree/atlantis_gracefull_shutdown) but I wait for this to be merged + the new docker image to be released to create the chart PR to avoid to release the chart update before the binary is ready

lkysow · 2020-03-27T19:16:45Z

This looks great! I'm wondering if we need the drain command though? Can the POST endpoint block until drain is complete? Then the lifecycle hook can just be curl -x POST localhost:4141/drain.

lkysow · 2020-03-27T19:17:23Z

I'm also wondering if there are some other tools out there that follow this pattern that I can look at?

benoit74 · 2020-03-27T19:20:19Z

Kubernetes is draining nodes of running pods as well, e.g. for performing an update of the kubelet, the OS, hardware failures, ... No idea where this is implemented in the codebase to be honest. Le ven. 27 mars 2020 à 20:17, Luke Kysow <[email protected]> a écrit :

…

I'm also wondering if there are some other tools out there that follow this pattern that I can look at? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#953 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABWF5CMBMZTAWXX6IIPNFRLRJT3VDANCNFSM4LS3YSPA> .

lkysow · 2020-03-27T20:51:54Z

Kubernetes is draining nodes of running pods as well, e.g. for performing
an update of the kubelet, the OS, hardware failures, ... No idea where this
is implemented in the codebase to be honest.

I meant more other tools that are running on Kube and how they solve this problem. For example I wonder if they have a separate command or just an endpoint that blocks.

MRinalducci · 2020-03-28T00:08:49Z

Hook handler implementation in Kub supports also HTTP, but only with GET method.
Seems a common way to do it too:

apiVersion: v1
kind: Pod
metadata:
name: client
spec:
containers:
 - image: mynginx
  name: client
  lifecycle:
   preStop:
    httpGet:
    port: 8080
    path: /shutdown

If we need POST method this have to be done with exec command + curl like you already mentioned.
I wouldn't implement the command if there is no added value than doing it directly trough the endpoint. Maybe the sleep part of the command aka returning when drain completed?

https://github.com/benoit74/atlantis/blob/72e54dca0df83a18bf19bc42270e69877defc7db/drain/drain.go#L29

References:

benoit74 · 2020-03-28T12:05:49Z

Hu, ok, did not got your question right.

Find below what I've found looking in helm charts preStop (not many to be honest).

cassandra : nodetool decommission
elaticsearch : pre-stop-hook.sh (not found what is inside the script)
etcd : Shell script for few checks, some data extraction (member name, ...) and finally etcdctl member remove
kong : kong quit (which in turns performs few checks, sends a signal to nginx, and force kill if needed)
nginx : use of signals (but this is generalized to the whole interaction with nginx, e.g. to reload the configuration there is also a signal, ...)
graylog : curl -X POST http://localhost:9000/api/system/shutdown/shutdown/api/system/shutdown/shutdown

@MRinalducci
The added value of the "atlantis drain" is indeed to wait for the completion of the drain. This is what k8s expect, i.e. the preStop command must not return before the pod can be terminated. Since it could takes many minutes, I was quite unconfortable to implement this "wait" in an HTTP method (long lived HTTP connections is not that reliable usually).

MRinalducci · 2020-04-29T11:16:54Z

Hi @lkysow any news about this PR? 😄

lkysow · 2020-04-29T16:53:45Z

Hey I've just been too busy to tackle this. I'll get to it when I can.

MRinalducci · 2020-04-29T17:05:59Z

Ok no problem, thanks for the update 👍

…utdown

lkysow · 2020-05-25T22:45:07Z

Hi @benoit74, I've merged this as part of #1051 however I made a couple of changes:

I removed the POST /drain endpoint and instead am using SIGTERM/SIGINT. I did this because a) I was worried about the security of having the /drain endpoint exposed and b) in my testing using the signal works with kubernetes and graceperiodseconds
Removed the atlantis drain command since it's not necessary anymore
I renamed the drain controller to status controller and changed the endpoint to /status so it can be a generic status endpoint.

I would loved to have worked with you on your code via reviews rather than coding on top however I don't have as much time as I'd like to work on code reviews and you opened this so long ago and I know it's annoying to have to reload all the context.

Hopefully this mechanism works for you and if there are any issues then please let me know and we can look to fix.

benoit74 · 2020-05-26T06:10:38Z

Hi @lkysow
Thank you very much for this work.
I see no reason why this wouldn't work for us, we will let you know if we face any issue.
I'm totally fine with this approach as well, and your arguments makes a lot of sense to me.
No worry about the fact that we didn't worked together on this, you got it right.
Do you have any idea of a timeframe regarding when this would be released ? (so that we stop to deploy a custom build ;o))
Thanks in advance

lkysow · 2020-05-26T16:42:21Z

I'll get a release out this week for sure.

benoit74 added 5 commits May 5, 2020 16:15

Add drain operation + endpoint to clean properly the server before sh…

e64fbad

…utdown

Move drainage checks to the async command runner

6d42a56

small fixes

46a1fe0

Add even more tests

c8bbad6

Fix failing test

fa64779

lkysow mentioned this pull request May 25, 2020

Support graceful shutdown #1051

Merged

lkysow merged commit fa64779 into runatlantis:master May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 320: support clean shutdown #953

Issue 320: support clean shutdown #953

benoit74 commented Mar 24, 2020

benoit74 commented Mar 24, 2020

codecov bot commented Mar 24, 2020 •

edited

Loading

benoit74 commented Mar 24, 2020

benoit74 commented Mar 26, 2020

lkysow commented Mar 27, 2020

lkysow commented Mar 27, 2020

benoit74 commented Mar 27, 2020 via email

lkysow commented Mar 27, 2020

MRinalducci commented Mar 28, 2020

benoit74 commented Mar 28, 2020 •

edited

Loading

MRinalducci commented Apr 29, 2020

lkysow commented Apr 29, 2020

MRinalducci commented Apr 29, 2020

lkysow commented May 25, 2020

benoit74 commented May 26, 2020

lkysow commented May 26, 2020

Issue 320: support clean shutdown #953

Issue 320: support clean shutdown #953

Conversation

benoit74 commented Mar 24, 2020

benoit74 commented Mar 24, 2020

codecov bot commented Mar 24, 2020 • edited Loading

Codecov Report

benoit74 commented Mar 24, 2020

benoit74 commented Mar 26, 2020

lkysow commented Mar 27, 2020

lkysow commented Mar 27, 2020

benoit74 commented Mar 27, 2020 via email

lkysow commented Mar 27, 2020

MRinalducci commented Mar 28, 2020

benoit74 commented Mar 28, 2020 • edited Loading

MRinalducci commented Apr 29, 2020

lkysow commented Apr 29, 2020

MRinalducci commented Apr 29, 2020

lkysow commented May 25, 2020

benoit74 commented May 26, 2020

lkysow commented May 26, 2020

codecov bot commented Mar 24, 2020 •

edited

Loading

benoit74 commented Mar 28, 2020 •

edited

Loading